[BWARE] Tune compressed matmul fast paths and Spark execution decisions by Baunsgaard · Pull Request #2483 · apache/systemds

Baunsgaard · 2026-06-08T13:03:10Z

Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer.

CLALib matmul changes:

CLALibMMChain: for XtXv with few col groups and a wide-enough matrix, compute X' * X via leftMultByTransposeSelf and finish with a regular matrix multiply against v. Cheaper than chaining when the X' * X path can stay compressed
CLALibTSMM: refactor leftMultByTransposeSelf into a package-private helper so MMChain can call it; widen the ColGroupUncompressed handling
CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero inputs; they have working preAggregate paths that beat the dense fallback
CLALibCompAgg: fix blklen rounding so the last partition is not short by k rows on parallel aggregates

Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp):

Hop: new helpers hasSparkOutput() and isScalarOrVectorBelowBlockSize() shared between unary and binary decision points
UnaryOp.optFindExecType: replace the inline chain of negations with isDisallowedSparkOps(), allow Frame outputs, and pull unary ops into Spark when the input already has a Spark output; gated on the ALLOW_TRANSITIVE_SPARK_EXEC_TYPE flag so it shares a kill-switch with the binary path
BinaryOp.optFindExecType: same kind of restructuring; allow matrix-or-frame outputs to be pulled into Spark when exactly one operand is a scalar or small vector

Instruction-side adjustments:

VariableCPInstruction (CAST_AS_MATRIX from frame): use the parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path instead of the single-threaded DataConverter helper
ParameterizedBuiltinCPInstruction (transformdecode): call the parallel decoder.decode(data, out, k) overload using InfrastructureAnalyzer.getLocalParallelism()
DecoderComposite (parallel decode): parallelize over row blocks instead of over decoders, so the sub-decoders still run in order within each block (e.g. recode-on-output depends on the category indexes from the preceding dummycode decoder); fall back to the sequential path for k <= 1

Testing:

New CompilerTestBase harness for compile-time exec-type assertions, plus SparkTransitiveExecTypeCompileTest and an end-to-end SparkTransitiveExecTypeTest (with DML) covering the unary/binary transitive Spark pull, its multi-consumer guard, the cumulative-op exclusion, and the flag-off kill-switch
CLALibMMChainTest, CLALibRightMultBySDCTest, and DecoderCompositeTest for the compressed matmul fast paths, ASDC right-multiply, and parallel composite decode; shared helpers added to CompressedTestBase

codecov · 2026-06-08T16:29:40Z

Codecov Report

❌ Patch coverage is 93.65079% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.45%. Comparing base (e4f0987) to head (f4ead24).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/main/java/org/apache/sysds/hops/UnaryOp.java	72.72%	0 Missing and 3 partials ⚠️
src/main/java/org/apache/sysds/hops/BinaryOp.java	92.85%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2483      +/-   ##
============================================
- Coverage     71.47%   71.45%   -0.03%     
- Complexity    48883    48885       +2     
============================================
  Files          1573     1573              
  Lines        189238   189266      +28     
  Branches      37128    37137       +9     
============================================
- Hits         135261   135234      -27     
- Misses        43530    43572      +42     
- Partials      10447    10460      +13

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer. CLALib matmul changes: - CLALibMMChain: for XtXv with few col groups and a wide-enough matrix, compute X' * X via leftMultByTransposeSelf and finish with a regular matrix multiply against v. Cheaper than chaining when the X' * X path can stay compressed - CLALibTSMM: refactor leftMultByTransposeSelf into a package-private helper so MMChain can call it; widen the ColGroupUncompressed handling - CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero inputs; they have working preAggregate paths that beat the dense fallback - CLALibCompAgg: fix blklen rounding so the last partition is not short by k rows on parallel aggregates Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp): - Hop: new helpers hasSparkOutput() and isScalarOrVectorBellowBlockSize() shared between unary and binary decision points - UnaryOp.optFindExecType: replace the inline chain of negations with isDisallowedSparkOps(), allow Frame outputs, and pull unary ops into Spark whenever the input already has a Spark output - BinaryOp.optFindExecType: same kind of restructuring; allow matrix-or-frame outputs to be pulled into Spark when exactly one operand is a scalar or small vector Instruction-side adjustments: - VariableCPInstruction (CAST_AS_MATRIX from frame): use the parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path instead of the single-threaded DataConverter helper - ParameterizedBuiltinCPInstruction (transformdecode): call the parallel decoder.decode(data, out, k) overload using InfrastructureAnalyzer.getLocalParallelism()

The multi-threaded DecoderComposite.decode submitted one task per decoder per row block, running all decoders concurrently. This broke the ordering dependency between decoders: recode-on-output reads the category indexes written by the dummycode decoder, so when the recode task raced ahead it read unwritten cells and produced null or the raw index instead of the original value. Parallelize over row blocks instead, running all decoders in order within each block via the sequential block decode. Also short-circuit to the single-threaded path when k <= 1. Fixes order-dependent failures in TransformFrameEncodeDecodeTest and TransformFrameEncodeColmapTest (dummycode single-node/hybrid) that surfaced once transformdecode started using the parallel decode path.

…isions Re-enable the betterIfDecompressed gate in CLALibRightMultBy so the decompressing right-multiply path stays reachable, while still excluding ASDC/ASDCZero column groups from forced decompression. Add targeted and end-to-end tests covering the recently tuned paths: - CLALibMMChainTest: the public CLALibTSMM.leftMultByTransposeSelf overload (wide/narrow/uncompressed/empty/reuse/null) and the XtXv mm-chain fast path, including a tile-then-recompress wide-chain case. - CLALibRightMultBySDCTest: right multiply on ASDC/ASDCZero inputs is not forced to decompress, single-threaded and parallel. - DecoderCompositeTest: parallel and single-thread composite decode, exercising the dummycode+recode ordering dependency. - SparkTransitiveExecTypeTest with DML scripts: UnaryOp/BinaryOp/Hop transitive Spark exec-type pulling under a constrained memory budget. - CompressedTestBase: two parameterized e2e cases that validate the TSMM overload and the wide XtXv fast path against uncompressed results across all compression configurations.

…e test Make the spark-specific decision refinement consistent between UnaryOp and BinaryOp: - UnaryOp: restore the input-is-not-checkpoint and single-parent guards, and drop the redundant `_etype != ExecType.SPARK` clause - BinaryOp: use the shared hasSparkOutput() helper instead of an inline optFindExecType() == SPARK check Add a compilation-verification test suite under component/compile that compiles a DML script into a runtime program and inspects instruction exec types without executing. CompilerTestBase provides the compile and plan-inspection helpers; SparkTransitiveExecTypeCompileTest verifies a CP-by-estimate unary on a Spark-resident input is pulled into Spark only when it is the sole consumer.

…e ops Extend SparkTransitiveExecTypeCompileTest to cover branches that were previously untested in the exec-type refinement: - BinaryOp transitive pull: a tall (multi-block) but byte-small Spark-resident column vector in a matrix-scalar op is pulled into Spark when it is the sole consumer, and stays CP when it has multiple consumers. - UnaryOp cumulative exclusion: a cumulative op on a Spark-resident input stays CP because cumulative ops are excluded from the transitive pull. These exercise Hop.isScalarOrVectorBellowBlockSize and hasSparkOutput across both branches. Inputs are consumed via indexing to avoid the sum(cumsum) algebraic rewrite that would otherwise eliminate the operator under test.

…nt nits Align the unary transitive Spark exec-type refinement with BinaryOp and clean up the new helpers introduced for the decision refresh: - Gate the UnaryOp transitive pull on the transitive flag so ALLOW_TRANSITIVE_SPARK_EXEC_TYPE acts as a real kill-switch for both unary and binary pulls. Default is true, so default behavior is unchanged; disabling the flag now keeps a pullable unary op in CP. - Rename Hop.isScalarOrVectorBellowBlockSize to isScalarOrVectorBelowBlockSize (spelling) and parenthesize the row/col clauses symmetrically. - Drop the unused k parameter from the private CLALibTSMM.addCorrectionLayer overload. - Simplify the redundant "v1 == true" in BinaryOp and fix a misleading comment on the disallowed-ops guard. - Add flag-off compile tests proving the kill-switch keeps an otherwise pullable unary and binary op in CP.

Shorten redundant inline comments added with the exec-type refinement: drop the parenthetical on the UnaryOp transitive guard, condense the compile-test class javadoc, and remove the noisy back-references and duplicated assert comments in the flag-off tests.

github-project-automation Bot added this to SystemDS PR Queue Jun 8, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 8, 2026

Baunsgaard force-pushed the split/mmChainTsmm branch 2 times, most recently from 4a9df3c to 85916ba Compare June 16, 2026 12:18

Baunsgaard added 4 commits June 17, 2026 13:29

Baunsgaard force-pushed the split/mmChainTsmm branch from f3b303c to e4dd03b Compare June 17, 2026 14:06

Baunsgaard added 3 commits June 23, 2026 15:45

Baunsgaard merged commit d71d469 into apache:main Jun 23, 2026
50 checks passed

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483

[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483
Baunsgaard merged 7 commits into
apache:mainfrom
Baunsgaard:split/mmChainTsmm

Baunsgaard commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Baunsgaard commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Baunsgaard commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading